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1 Introduction 



Large-scale sequence analysis of the AD 169 strain of human cytomegalovirus 
(HCMV) began in this laboratory in 1984 when very little was known about the 
sequence or location of genetic information in the viral genome. At that time 
sequence analysis was confined to the major immediate-early gene (Stenberg et al. 
1 984), a region of the Colburn strain that contained CA tracts (Jeang and Ha yward 
1983), the L-S junction region (TAMASHiROet al. 1984), and what has been termed the 
transforming region (Kouzarides et al. 1983). This chapter is being written in 
March 1989 when the sequence is complete except for some remaining polishing of 
certain areas which is still going on (manuscript in preparation). As far as we know 
there are no major discrepancies in the data which might lead to the sequence 
changing although of course this cannot be ruled out. We present a preliminary 
analysis of the HCMV genome and limit ourselves mainly to the potential protein- 
coding content of over 200 reading frames. 



2 Sequence Analysis 



The sequence has been determined by M13 shotgun cloning and chain termination 
sequencing. In this random approach each base is sequenced many times on average 
so that the consensus produced should be highly accurate. The sequencing strategy 
involved applying this random procedure to each HindlU fragment of the viral 
genome (Oram et al. 1982). However, the high G + C content caused severe 
problems as manifested in the many compressions encountered on the sequencing 
gels. This entailed resequencing many clones substituting dITP or 7-deazaGTP for 
dOTP in the reactions to minimize the effect. All sequences have been determined on 
both strands. Detailed accounts of the methods used are published elsewhere 
(Bankier et al. 1987; Bankier and Barrell 1989). The sequences at the ends of the 
genome which were not generated in the HindlU library were obtained from the 
//mdlll junction fragments C (equivalent to I andQ) and G (equivalent to K and Q) 
which were sequenced in their entirety, and from a portion of the HindlU B (K and 
H) junction fragment from the HindlU W/H end to the £coRI site 21.2 kb 
downstream (Weston and Barrell 1986) (Fig. 1). Sequences were also obtained 
across all the //indlll sites. Double-stranded sequencing on appropriate overlapp- 
ing cosmid and plasmid clones (Fleckenstein et al. 1982) confirmed that the 
sequence was contiguous except for an extra 393-bp fragment which was found 
between HindlU T and E, and which we have named HindlU d. The final map in the 
prototypical orientation of the viral genome with the HindlU fragments predicted 
from the sequence is shown in Fig. 1. As the precise ends of the molecule are not 
known, we have chosen to number the sequence from the start of the direct repeat 
(DRl) found by Tamashiro et al. (1984). By analogy with the "a" sequence of other 
herpesviruses, this is the closest feature to the end of the genome (Mocarski and 
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RoiZMAN 1982; TAMASHiRoet al. 1984; Spaetc and Mocarski 1985b). Our sequence is 
numbered from base 2352 of Tamashiro et al. (1984) but reading backward on the 
complementary strand. It contains a single copy of a DRl -flanked 578-bp sequence 
at each end and at the junction of the internal repeats. The sequence we have 
determined consists of 229 354 base pairs. The long unique region (UL) is 166 972 bo 
and the surrounding repeats (IRL and TRL) are 1 1 247 bp each. The short unique 
region (US) is 35418 bp and is flanked by 2524-bp repeats (IRS and TRS). In the 
sizes given above, IRL and IRS are considered as overlapping by one copy of the 
DR 1 -flanked repeat unit. The long repeats are identical except for two base changes- 
a C at position 5288 and a G at position 8293 are both substituted by As in the 
equivalent IRL positions. The former change does not affect any predicted coding 
sequences, while the latter affects TRL/IRLIO (Table 1). Two differences were also 
found in the short repeats: in IRS, an A at position 189887 and a G at position 
190332 are substituted by C and T respectively in TRS. The former difference is 
silent while the latter changes a valine residue in HCMV-IRSl to a leucine in 
HCMV-TRSl. 
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3 Prediction of Reading Frames 



Very little of the genome has been mapped in terms of its transcription or its 
expression. In order to analyze the protein-coding content of the sequence we need 
to define the criteria for the selection of the reading frames we think are most likely to 
be coding. A description of the procedures we have applied is given below. 
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3,1 Criteria for Selection 

Analysis of other herpesvirus genomes shows that in most regions the reading frame 
that is coding is the longest and that such reading frames are arranged end to end on 
either strand with very little noncoding sequence in between. Very few overlapping 
genes have been found although there are sometimes small overlaps at the 
beginnings and ends of genes. Thus the strategy we have adopted has been to screen 
the sequence for reading frames that are over a certain length and then to filter out 
any smaller frames that overlap larger ones by a certain amount. The cutofls that we 
have chosen are a minimum length of 300 bp (i.e., a coding potential of 100 amino 
acids) and a maximum allowable overlap of a larger reading frame of 60%. This 
latter figure allows for the fact that a reading frame may be open upstream°of the 
actual initiation codon and that this may lie under the preceding gene. There are 778 
reading frames over 300 bp of which 58 1 are screened out on the grounds that they 
are overlapped extensively by larger frames, leaving 197 candidate protein-coding 
genes. The sequence is then examined for reading frames of less than 300 bp that may 
lie in the gaps that are left. Likely frames are selected by experience using criteria 
such as logical combinations of potential transcription signals with the reading 
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frame and any potential translational start; homology to other reading frames or 
known genes; and the presence of protein structural or functional motifs in the 
amino acid sequence. Codon bias can also be used as described below. The whole 
procedure will not work where genes are spliced and the exons are small. In those 
regions of the genome where the genes are highly spliced or in regions which are 
noncoding, small background noncoding reading frames will have been included 
which would otherwise have been screened out if larger coding reading frames were 
present. We think that this is particularly true in and bordering the repeat sequences 
and in certain regions of the Hindlll D and E fragments. In a few cases we have 
substituted a smaller frame for a larger overlapping frame where we have found 
compelling reasons to choose the former. 
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3.2 Codon Bias 

Patterns of codon usage that could conceivably be generated only through the 
genetic code are, in the absence of any other criteria, the best indication that a 
sequence is coding for protein. The high G + C content of HCMV (57.2%) leads to 
an accumulation of G and C in the third, degenerative, position of the codons. This 
is because in an average amino acid sequence the excess G and C cannot be 
accommodated in the first and second positions without biasing the sequence to 
amino acids encoded by GC-rich codons. Figure 2 shows a G 4- C plot across the 
entire sequence. As can be seen there is considerable variation in the G + C content 
across the genome, particularly in the repeat areas, the regions bordering the 
repeats, and the HindlU D fragment. Because of this variability we have not yet been 
able to find a single formula that we could apply equally to all areas of the genome to 
justify further our selection of reading frames on the basis of size and position. 
However, codon bias does serve as a useful check in those areas with a high G + C 
content. 



3.3 HCMV Map 



The preliminary map of 208 reading frames deduced from the sequence using the 
criteria discussed above is shown in Fig. 3. Details are given in the figure legend of 
individual frames that we have omitted from the original set of 197 (Sect. 3. 1) and the 
criteria for inclusion of replacement frames. Although some of the frames shown are 
unlikely to be coding (for example, U L 1 26 which overlaps the (noncoding) exon I of 
the major immediate-early gene and part of the enhancer) we preferred to include all 
frames meeting our minimal criteria unless a more plausible alternative candidate 
could be identified. 
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4 Identification of Homologs 



The HCMV protein sequences were screened against the PIR (release 19.0; George 
et al. 1986), and SWISS PORT (release 8.0; Bairoch 1988) libraries using the Fast A 
program of Pearson and Lipman (1988). Searches were also performed against a 
herpesvirus protein library including HSV-1, VZV, and EBV sequences. In these 
library comparisons alignments were examined when optimized Fast A scores of 90 
or greater were obtained, although in some cases lower-scoring matches were also 
scrutinized. Some of the HCMV sequences match numerous reading frames as a 
result of compositional bias, which may be general throughout the sequence or 
localized. For example, glycine-rich stretches occur in a number of reading frames, 
including HCMV-UL44, 56, 102, 1 12, and TRS/IRSI. In most cases highly biased 
matches have been excluded. Sometimes, however, these similarities are likely to 
; reflect functional similarities, if not homology. For example. HCM V-UL 1 22, which 

encodes an immediate-cady transactivator, is similar to HSV-IEllO, also an 
\ji immediate-eariy transactivator. The results of overall homology searches, motif 

* ^ searches (Staden 1988), and comparisons of gene layout with EBV, VZV, and HSV- 

: 1 have been amalgamated in the compilation of human herpesvirus and cellular 

homologs. Functions ascribed to HCMV genes or their homologs are noted in Table 
1. Homologies detected to the sequenced herpesviruses are shown in Table 2. A 



1 = 



Fig. 3. A map of predicted open reading frames in HCMV strain ADI69. Two hundred and eight 
mdividual frames are recognized, some of which are known to be spliced. The reading frame map is drawn 
in the prototype orientation below the HMIU restriction map. The diagram is scaled in kilobase pairs 
Open reading frames which overlap on the same strand are displaced in the figure. Frames are numbered 
separately except for three genes for which sphce sites have been precisely located (HCM V-UL36 UL37 
and ULI23) (Kouzarides et al. 1988; S-resBERC et ai, 1984, 1985), and one gene for which the splice sites 
areprobably.conscrved withotherherpesviruses(HCMV.UL89){CosTAet al. 1 985). Genes which maybe 
sphced to upstream frames, but which are also capable of being initiated at a proximal ATG are 
numbered separately (HCMV-UL36, UL38. ULI22). Frames are designated TRL, IRL, UL,TRS. IRS, or 
US according to the region of the genome in which their 5' ends are located, and each of these six set's is 
numbered from /. A frame which spans the DRl repeats (Sect. 2) and hence is capable of crossing the 
genomic termini has been designated J (junction) /. Three manifestations of this frame which differ in 
their 5' and 3' termi ni occur, and are shown as 7 / L. J / S. and 7 // (where L, S, and / denote long, short and 
iniemal respectively; see also Table 1 ), The "a" sequence is shown as a thin vertical line located within the 
repeats. The following frames have been included in place of longer overlapping frames; the names of the 
latter (not shown) are given in brackets, together with reasons for the substitution; the orientations of the 
substituted frames are indicated by the direction of numbering : i, JI L, and TRLl (TRLIX, positions 
291-1361; these frames occupy the region more completely, with minimal overlap. TRL! has a proximal 
TATA box and a Kozak consensus ATG). [NB. J 1 L completely overiaps a frame equivalent to HKRFX 
(Weston and Barrell 1986) (not shown, positions 873-43)]; 2, UL38(UL38X, positions 51 098-52 141- 
third position G + C; see Sect. 5.3); i, U L 106 (U L 1 06X. positions 1 55 043 - 1 55 465; third position G + C); 
4, ULl I2(UL1 12X, positions 161 638- 160 466; third position G + C; mapping data; Wright et al. 1988); 
J,ULI23(UL123X positions 1 72 33 1 - 1 72 8 1 6; overlaps major immediate-eariy gene exons 2 and 3); 6 J 1 1 
and IRLl (IRLIX, positions 189 176-188 106; see / above). US25X (former name HHRFl, positions 
215051-215 518; Weston and Barrell 1986) had an excessive overlap with US25 and was omitted 
without another frame being substituted in its place. The small frame Ul 1 1 1 A (marked as A) was included 
because it has a Kozak consensus ATG, a transcript has been identified in the region, and it is a conserved 
feature ofatransfonning region in HCMVsTowneand ADI69(RAZ2AQUEet al. 1988;JAHANet al. 1989). 
The frame is one amino acid shorter than the Towne sequence, having a relative 3-bp deletion, but the 
predicted amino acid sequence is otherwise identical 
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Table 2. Homologs of HCMV-rcading frames in the sequenced herpesviruses. Internal HCMV-relatcd 
sequences as well as EBV. VZV, and HSV-I homologs are listed, together with FasiA scores (Pearson 
and LiPMAN 1988). HCM V homologous families containing three or more sequences are indicated only in 
Table 1. We have found from experience that FastA scores above 100 are often significant, except when 
sequences are highly biased in composition. Homologs which were not identified by library searches, but 
which were inferred from their collinearity with other conserved frames, are scored as P (positionally 
conserved). Listings scored as P? should be regarded as tentative at best. Listings with a question mark and 
a FastA score show borderline similarity in the absence of supporting evidence and should be regarded as 
speculative. In most cases the highest scores above 90 were listed. Compositionally biassed matches were 
excluded for the following frames: HCMV-TRL/IRL4, TRL/IRL13, UL32, UL44, and ULII3, 
Nomenclature for EBV, VZV and HSV-1 frames is conventional (Baer et al. 1984; Davison and Scott 
1986; McGeoch et al. 1988a); the EBV sequence designated as LP (leader protein) is translated from the 
spliced EBNA2 mRNA (Wang ct al. 1987) 
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survey of HCMV proteins including map assignments in the AD 169, Towne, and 
Davis strain genomes has been conducted previously by Landini and Michelson 
(1988). 



5 IE Genes 



The activation of IE genes is the initial step in a viral program of gene expression. 
Northern hybridization studies have shown that transcription from the HCMV 
genome during the immediate early phase of productive infection is limited to 
several discrete loci, with the most active region located near one end of UL 
(DeMarchi 1981; Wathen and Stinski 1982; McDonough and Spector 1983; 
Jahn et al. 1984; Wilkinson et al. 1984). This major immediate-early (MIE) region 
has been studied in several CMV strains, and unlike the bulk of the CM V genome is 
CpG suppressed (Honess et al. 1989). The MIE genes encode regulatory proteins, 
the expression of which requires only cellular factors, although virion components 
may also play a transactivating role (Spaete and Mocarski 1985a; Stinski and 
ROEHR 1985). More recently two other immediate-early loci have been sequenced 
and characterized in AD 169 (Kouzarides et al. 1988; Weston 1988). 



5.1 MIE Gene Region 

The first sequence data for this region were reported for HCMV Towne (Stenberg 
et al 1984) and showed the four-exon arrangement of the major immediate-early 
(IE 1) gene. Sequence analysis of the corresponding AD 1 69 region revealed a similar 
arrangement with minor differences. Only two changes were observed at the amino 
acid level (Akrigg et al. 1985). The organization of the equivalent murine CMV 
gene is grossly similar, but differs considerably at the sequence level (Keil et al. 
1987). Analysis of the HCMV IE promoter region exposed a complex array of 21-, 
19-, 18-, and 16-bp repeats upstream of the TATA and CAAT boxes (Thomsen et al. 
1984; Akrigg et al. 1985). The upstream sequence demonstrates a potent enhancer 
activity, detected by its ability to rescue enhancerless S V40 genomes (Boshart et al. 
1985). Homology with the core enhancer sequence TGGAAAG/TGGTTTG was 
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noted in the 18-bp repeats and potential Spl -binding sites were also found, fhe 
enhancer binds cellular factors (Ghazal et al. 1987, 1988) and dissection has shown 
that the 19-bp elements can mediate cAMP induction (Fickenscher et al. 1989; 
HuNNiNGHAKE et al. 1989). Similar enhancers were also found in murine and simian 
CMVs (Dorsch-Hasler et al. 1985; Jeang et al. 1987). Nuclear factor I binding 
sites are associated with the enhancer region in both human and simian CMVs 
(Hennighausen and Fleckenstein 1986; Jeang et al. 1987). 

Stinski et al. (1983) recognized two further IE regions beginning immediately 
downstream of lEl. The IE2 region has more recently been called IE2a and a further 
region recognized as IE2b (Hermiston et al. 1987; Stenberg et al. 1985). Under 
immediate-early conditions, transcription of the IE2a region starts mainly from the 
IE I promoter and a set of alternatively spliced transcripts is produced. In the 
predominant species the IE2a exon (HCMV-UL122 in AD169) is fused to the first 
three exons of lEl. HCMV-UL122 encodes 494 amino acids following the splice 
acceptor. This is in agreement with the size predicted of the IE2a exon reported for 
the Towne strain by Pizzorno et al. (1988). A 1.7-kb unspliced mRNA can also 
originate from a promoter proximal to the IE2a frame (which also contains a Kozak 
consensus ATG; Kozak 1981). This transcript is more abundant at early and late 
times postinfection (Stenberg et al. 1985). The product of the IE2a frame may be 
involved in autoregulation (Pizzorno et al. 1988). A minor transcript extending into 
the IE2b region has been diagrammed (Hermiston et al. 1987). We are unable to 
correlate this with the AD 1 69 sequence using the available information. However, a 
potential splice donor occurs before the UL122 termination codon, and a poly A 
signal at position 167 503 is consistent with the predicted end point of the Towne 
transcript. It is likely that the reading frames on either side of this signal, ULl 19 and 
ULl 18, are spliced together to encode a membrane glycoprotein. 



5.2 HCMV US3 IE Gene 

Sequencing of the US region of HCMV revealed an enhancer element containing 
five l8-bp repeats with homology to the MI E 18-bp repeats and the core enhancer 
element (Weston 1988). These repeats were located in the region -80 to -270 of an 
RNA cap site in the HCMV-US3 (HQLFl) gene. In the region -340 to -600 a further 
set of six novel 11 -bp repeats was found. A 275-bp fragment containing the 18-bp 
repeats enhanced expression in an orientation-independent manner in HeLa cells, 
with an efficacy equivalent to the SV40 enhancer (Weston 1988), while the MIE 
enhancer 18-bp repeats have recently been shown to be involved in positive 
autoregulation by lEl (Cherrington and Mocarski 1989). The significance of the 
ll-bp repeats is unknown. However, a hexanucleotide consensus (TRTCGC) 
derived from these repeats was noted to occur in the MIE enhancer (Weston 1988). 
Transcription from the HCMV-US3 reading frame associated with the enhancer is 
highly active at IE times and produces a set of differentially spliced transcripts. The 
protein-coding sequence of HCMV-US3 contains signal, anchor, and N-linked 
glycosylation sequences, is homologous to HCM V-US2 (HQLF2), and may also be 
related to the RLII and US6 gene families (Sect. 8). 
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53 UL37 IE Gene 

A second UL IE transcription unit was identified in the region of the AD169 Hindlll 
J and Z fragments (Wilkinson et al. 1984). The sequence of this region together with 
mapping data for three mRNAs has been published (Kouzarides et al. 1988). A 3.4- 
kb IE transcript was shown to be spliced from four exons and, like HCMV-US3, 
encodes a potential glycoprotein. This mRNA is 3' cotcrminal with a l.65-kb 
transcript which can be detected in the IE phase but is more abundant at the late 
stage of infection. The predicted product of the 1.65-kb mRNA is a member of the 
US22 homologous protein family (Sect. 7.2). A 1.7-kb transcript utilizing the same 
promoter as the 3.4-kb mRNA is most abundant at IE times but can also be detected 
late in infection. Of the mapped transcripts only this RNA contains the HCMV- 
UL38 (HZLF3) reading frame. However, expression of UL38 from this transcript 
would require the upstream UL37 exon 1 to be bypassed; alternatively, the frame 
may be read from an uncharacterized low-abundance transcript (Kouzarides et al. 
1988). A 40-kDa protein synthesized in vitro from HiVidlll Z or J hybrid-selected 
mRNA is consistent with translation from UL38 (Wilkinson et al. 1984). Although 
a slightly longer reading frame completely overlaps UL38 on the opposite strand 
(UL38X, not shown), analysis of third position G + C contents suggests that of the 
two opposing frames UL38 is more likely to be coding (84.3% vs 62.8% G -h C). 



6 Early and Late Genes 



Immediate-early proteins are required to activate genes which establish the early or 
delayed early (E or DE) phase of infection, the outcome of which is the replication of 
the viral genome. Late genes are expressed at high levels after DNA replication and 
are likely to encode most of the structural and assembly proteins of the virus. The 
distinction between E and late phases is blurred for some genes, and is further 
complicated by posttranscriptional regulation of gene expression (DeMarchi 1983; 
Geballe et al. 1986a; Goins and Stinski 1986). In the following sections we attempt 
to correlate the available information on E and late genes with our sequence data. 
The organization of the following sections superficially resembles the viral timetable 
as convenient, but may be similarly inscrutable in places. 



6.1 Major Early Transcripts 

The most abundantly transcribed region of HCMV at early times postinfection is 
situated in the long repeats of the virus and encodes a 2.7-kb transcript of unknown 
function (Greenawav and Wilkinson 1987; Hutchinson et al. 1986; McDonough 
et al. 1985). An eariy transcript of similar size also originates in RL of HCMV Towne 
(Wathen and Stinski 1982), one copy of which can be deleted without compromis- 
ing viability in cultured human fibroblasts (Spaete and Mocarski 1987), 
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Green A WAY and Wilkinson (1987) determined a 6220-bp sequence in HCMV 
AD169 which encompasses the gene for the 2.7-kb transcript Their sequence is 
equivalent to positions 1635-7859 of Fig. 3 viewed in the opposite orientation. (We 
refer only to TRL sequence positions for clarity.) It contains two ambiguities and 
differs from our sequence at nine positions. However, only one of these is located 
within the major early transcription unit; the doublet CC beginning at position 3386 
of Greenaway and Wilkinson (1987) is a triplet in our sequence. The open reading 
frame corresponding to the predicted translation product of the major 2.7-kb 
transcript as mapped by these authors is TRL/IRL4. The translational start is 
suggested to be the fourth ATG from the start of the transcript and occurs at 
position 4294 in our sequence. This is not a Kozak ATG in that it does not have a 
purine at -3 or a G at -1-4 (Kozak 1981, 1982). However, two upstream ATG 
codons fit the Kozak consensus. The first has the sequence CGGATGG and is 
followed by a stop codon after seven amino acids. The second has the sequence 
GAGATGA and begins a 35-amino-acid reading frame. These codons have been 
shown to inhibit translation from a downstream AUG and may therefore be cis- 
regulatory signals (Geballe et al. 1986a; Geballe and Mocarski 1988). Upstream 
Kozak consensus ATGs precede a number of other HCMV genes, and suggest a 
general phenomenon in HCMV translational regulation. However, this role has yet 
to be demonstrated directly and so far no products have been found for the major 
early transcript. A less-abundant 2.0-kb transcript has been mapped immediately 
downstream of the 2.7-kb transcript in the Eisenhardt strain of HCMV 
(Hutchinson etal. 1986). The predicted polyadenylation site is conserved in 
AD 169, beginning at position 6552 in our sequence. However, a similar-sized 
transcript was not detected (McDonough et al. 1985). It is also not possible to 
suggest a 5' end from the Eisenhardt strain restriction map data. There are, however, 
no reading frames that might obviously be utilized in this region with the exception 
of TRL/IRL6. A minor 1.3-kb immediate-early RNA and a 1.2-kb late RNA have 
also been mapped to this general region (McDonough et al. 1985; Hutchinson 
et al. 1986); the latter is detected at early times postinfection but is most abundant in 
the late phase. The poiyA signal for this message was located precisely in the 
Eisenhardt strain and begins at position 6365 of our sequence (Hutchinson et al. 
1986). These authors also mapped the start of the transcript by nuclease protection 
and found no evidence for splicing. Further mapping and sequencing studies, the 
latter performed on genomic as well as cDNA clones, were used to predict a coding 
frame of 254,amino acids within the transcript (Hutchinson and Tocci 1986). The 
region sequenced corresponds to positions 6300-7468 of Fig. 3 (displayed in the 
IRL orientation). However, in AD 169 the 254-amino-acid reading frame is 
disrupted by three stop codons and two frameshifts relative to the Eisenhardt 
sequence and is identical in both repeats. Our data and those of Greenaway and 
Wilkinson are in agreement for the region spanned by the putative reading frame. 
We are unable to predict a reading frame which may be translated from this message 
in AD 169. The first Kozak ATG occurs 164 nucleotides downstream of the 
transcription start predicted by Hutchinson and Tocci (1986), but is followed by a 
stop codon after 42 intervening amino acid codons. Furthermore, although 
TRL/IRL7 is located in this message, it is over 500 bp from the predicted start. If 
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these differences between the Eisenhardt and AD 1 69 strains are genuine, sequencing 
from other strains would be useful in assessing their biological relevance. 



6.2 Enzymes of Nucleotide and DNA Metabolism 

6,2.1 Nucleotide Metabolism 

HoNESS (1984) postulated that differences in overall base compositions between 
herpesvirus genomes reflect the ability of the viruses to modulate and utilize the 
nucleotide pool available for DNA synthesis. This hypothesis appears to be borne 
out in the case of the two closely related a-herpes viruses, HS V- 1 and VZV. The latter 
is AT rich and encodes a thymidylate synthase, which does not have a homolog in 
the G + C rich HSV-1 genome (Thompson et al. 1987; McGeoch et al. 1988a). A 
parallel exists in the less closely related ^-herpesviruses Epstein-Barr virus (EBV) 
and herpesvirus saimiri (HVS); the latter A + T rich virus encodes thymidylate 
synthase and dihydrofolate reductase, which both seem to be absent from the G + C 
rich EBV (Honess et al. 1986; Trimble et al. 1988; Baer et al. 1984). All four viruses 
also encode deoxyribonucleoside kinases, and hence can utilize the salvage 
pathway of dNTP synthesis (McKnight 1980; Davison and Scott 1986; Littler 
et al. 1986; Gompels et al. 1988a). These enzymes differ in their substrate specificity 
and their main role might be to allow the exploitation of specific cell types, such as 
may occur in latency. Genes for ribonucleotide reductase, a key enzyme in 
deoxyribonucleotide synthesis, have been found in HSV, VZV, and EBV as well as 
other herpesviruses, but have not so far been identified in HVS (Gibson et al. 1984; 
Davison and Scott 1986; Nikas et al. 1986), The HCMV genome is relatively 
G + C rich (Fig. 2) and it will be of interest to determine if its complement of enzymes is 
consistent with the theory of Honess (1984), HCMV does not appear to encode a 
thymidine (deoxyribonucleoside) kinase (TK); the position in the AD 169 genome 
equivalent to the TK locus in other herpesviruses is deleted relative to the other 
herpesviruses (Fig. 3). However, HCMV is sensitive to the nucleoside analog 
DHPG, and a resistant mutant of AD 169 has been isolated which accumulates less 
of the triphosphate form of the drug (Biron et al. 1986). This may indicate that a 
deoxyribonucleoside kinase is encoded at some other locus. 

The partial conservation of a ribonucleotide reductase (RR) homolog is more 
puzzling. Mammalian cells contain an iron-tyrosyl radical enzyme, which is the type 
found in herpesviruses (Sjoberg et al. 1985; Reichard 1989). The enzyme has an 
a2^2-structure; the HCMV-UL45 gene product is homologous to the a-(large) RR 
subunit, and HCMV-UL45 is positionally conserved with the gene for this subunit 
in other herpesviruses. However, the gene for the jS-(small) subunit does not appear 
to be conserved; HCMV-UL44 is positionally analogous to the small RR gene in 
other herpesviruses but encodes a set of late DNA-binding proteins (see Sect, 6.5). 
The small subunit contains the active tyrosyl radical and would be essential for 
function. Thus it is not clear at present if HCMV is capable of expressing a fully 
active ribonucleotide reductase. Although we have used loosely defined motifs to 
search all the predicted reading frames for a potential active site, no obvious 
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candidates were identified. Several explanations could account for this. For 
example, if HCM V-UL45 is functionally conserved v^ith the large subunit, it might 
usurp the place of its cellular counterpart which mediates ailosteric control as well 
as being involved in catalysis. Herpesviral reductases appear to be unregulated, 
indicating that the function is either unnecessary or perhaps detrimental in the viral 
context (Laniken et al. 1982; Avertt et al. 1983). It is also possible that synthesis of 
one or both of the cellular subunits is upregulated during viral infection (Stinski 
1977). The genes for the human RR subunits are unlinked; the a-subunit gene is on 
chromosome 11 (Engstrom et al. 1985), and the j3-gene on chromosome 2 (Yang- 
Feng et al. 1987). Finally, it is worth mentioning that another key ailosteric enzyme 
of nucleotide metabolism is dCMP deaminase; this enzyme converts dCMP to 
dUMP, which is the substrate for thymidylate synthase. Hence it might be an 
appropriate enzyme for herpesviral repertoires, particularly those which have 
devolved to an A + T bias. 

6.2.2 DNA Replication 

A set of seven HSV-1 genes has been shown to be essential for the replication of an 
HSV-origin-containing plasmid (Wu et al. 1988; McGeoch et al. 1988b). The 
HCMV homologs of four of these have been identified bv sequence analysi s. 
HCMV-UL54 encodes the DNA polymeras e (Kouzaripes et al. 1987a; Heilbronn 
et al. 1987) and HCMV-UL57 the major DNA-bindinRj arotein (MDBP). The latter ^ ^^'"^ 
sequence shows 72% identity over a length of 1160 aligned amino acids to the 
MDBP of simian CMV (Coibum) (Anders and Gibson 1988; Anders and Gibson, 
personal communication). HCMV-UL105 encodes a homolog to HSV-UL5, which 
is probably a helicase enzyme (Crute et al. 1988, 1989). Helicases belong to a 
superfamily of proteins with functions in replication and/or recombination 
(Hodgman 1988). A nucleotide-binding site in UL105 (Martignetti 1987), of the 
type GxxGxGK (where x = any amino acid), is common to the other members of the 
superfamily. HCMV-UL70 is t he fourth HCMV gene with an obvious replicatio n 
genc_co unterpart, in HSV^ ULSl The product of HSV-UL52 is part of a helicase- 
primase complex in HSV-1 -infected cells which also contains the HSV-UL5 and 
UL8 proteins (Crute et al. 1989). HCMV genes UL102 and ULlOl are positionally 
equivalent to HSV-UL8 and UL9 respectively, although they show no clear-cut 
homology. However, HCMV-UL102 is a similar length to HSV-UL8 (798 and 750 
residues respectively). HSV-UL9 encodes an origin-binding protein (Olivo et al. 

1988) , and the positive identification of its HCMV counterpart may require the 
identification of an HCMV origin of replication. 

6.2.3 DNA Repair 

The gene for uracil-DNA glycosylase, which is involved in base excision repair, was 
identified in HSV-2 and is conserved in the sequenced herpesviruses (Worrad and 
Caradonna 1988; Baer et al. 1984; Davison and Scott 1986; Mullaney et al. 

1989) . The corresponding HCMV-reading frame is HCMV-ULl 14, which is the last 
frame at this end of UL with detectable homology to sequenced human herpes- 
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viruses. A dUTPase gene is also conserved in herpesviruses, albeit less well than 
uracil-DNA gly cosy lase (Preston and FtsHER 1984; Davison and Scott 1986; Baer 
et al. 1984). The HCMV homoiog is HCMV-UL72. 

6.2A Deoxyribonuclease 

A deoxyribonuclease gene found in HCMV appears to be ubiquitous in herpes- 
viruses, as homologs are found in HHV-6 (Lawrence et al., unpublished results), 
EBV (Zhang et ai. 1987), HSV (McGeoch et al. 1986), and VZV (Davison and 
Scott 1 986). The role of this enzyme is currently unknown, but it may be involved in 
cleavage of viral concatemers and/or the processing of genome termini (Chou and 
RoizMAN 1989). 



63 Phosphotransferase 

The putative phosphotransferase encoded by HCMV-UL97 is conserved in the 
human herpesviruses and distantly related to the protein kinase family (Chee et al. 
1989a; Smith and Smith 1989). Interestingly, some of the most conserved amino 
acids in protein kinases are variant in the herpesvirus sequences. One motif where 
these differences occur is shared with bacterial phosphotransferases, which vary at 
the same amino acid positions as do the herpesvirus proteins (Brenner i 987). Hence 
it remains to be shown if HCMV-UL97 and its homologs are in fact conventional 
kinases. Whatever its specific role, the preservation of this gene in all of the 
recognized herpesvirus lineages and HHV-6 implies an important or indispensable 
contribution to the viral life cycle. None of the other HCM V-reading frames we have 
screened have detectable homology to known protein kinase motifs, which are seen 
in the a-herpesvirus US-encoded kinases (McGeoch and Davison 1986). 

6.4 Early Phosphoprotein Genes 

The gene for a set of phosphoproteins sharing a common N-terminus has been 
mapped by Wright et al. (1988). These authors mapped the termini of two spliced 
12-kb early transcripts, raised an antiserum against a synthetic peptide predicted 
from a 5'-terminal portion of the 5'-exon sequence (Kouzarides et al. 1983; 
Rasmussen et al. 1985a) and used this to detect four proteins of 34, 43, 50, and 
84kDa in infected cells (Wright et al. 1988). Pulse-chase experiments did not 
suggest that any of the proteins were derivative in nature. Although the mapping 
data are as yet incomplete, it would thus appear that all four proteins are coded in 
alternatively spliced mRNAs sharing a 5' exon. This exon corresponds to ULl 12 in 
our sequence. A 279-bp portion of the UL 11 3 frame (positions 161 503-161 781) is 
flanked by potential acceptor and donor sites, and may correspond to a 280-bp exon 
mapped by Staprans and Spector (1986). The downstream exons may also be 
derived from ULl 13, which extends to position 162797. A polyA signal begins at 
position 162909, but there is an alternative poly A sequence coinciding with the end 
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of UL113 (ATTAAA, beginning at position 162796). It therefore seems likely that 
one or both of these signals indicates the end of the transcription unit. The four 
proteins were found to be predominantly contained in the nuclear fraction of 
infected cells, and were not shown to be virion structural proteins in preliminary 
studies (Wright et al. 1988). 



6^ Late DNA-Binding Proteins 

Mocarski and coworkers utilized immunological screening of a Agtll expression 
library to map a group of proteins known as the ICP36 family to the HCMV-UL44- 
reading frame (Mocarski et al. 1985; Leach and Mocarski 1989). The ICP36 
proteins gravitate to the nucleus, include phosphorylated and glycosylated species, 
and are DNA-binding proteins (Pereira et al. 1982; Gibson 1983; Mocarski et al. 
1985). Regulation of HCMV-UL44gene expression is manifested in both early and 
late transcription from different TATA boxes, and delayed translation of early 
message (Leach and Mocarski 1989; Geballe et al. 1986b). The significance of this 
complex control is unclear, although it is interesting that the 3'-end of the reading 
frame is overlapped by a gene encoding a small RNA in the same orientation. This 
gene is probably transcribed by RNA polymerase III (Marschalek et al 1989). 



6.6 Capsid Proteins 

The gene for the major capsid protein (MCP) was identified by sequence homology 
to the MCP sequences of other human herpesviruses and the assignment confirmed 
immunologically (Chee et al. 1989b). The MCP is encoded by the HCMV-UL86 
reading frame. Homology searches show that the predicted protein sequence of 
another frame, HCMV-UL47, is s imilar to a region of the human herpesvirus majo r 
capsids corresponding approximately to positions 1080-1 170 of Fig. 3 in (Chee 
eraLl989b). Although this match may be fortuitous, the alignment of HCMV- 
UL47 to conserved capsid sequences makes it of interest. However, the sequence is 
not obviously conserved in the EBV, VZV, and HS V- 1 reading frames collinear with 
HCMV.UL47. 

A second capsid protein, which is a constituent of incomplete capsids, has been 
mapped in the UL region of three CMV strains (Robson and Gibson 1989). Several 
lines of evidence implicate this protein in DNA packaging and/or capsid assembly 
(Preston et al. 1983; Irmiere and Gibson 1985; Lee et al. 1988; Rixon et al, 1988). 
The gene for the putative assembly protein is conserved in the human herpesviruses, 
and is predicted to encode proteins of 635, 605, 605, and 708 amino acids in HSV, 
VZV, EBV, and HCMV respectively (McGeoch et al. 1988a; Davison and Scott 
1986; Baer et al. 1984) (Table 1). The sequence of a 1-kb cDNA derived from the 
Colbum strain of CMV shows homology only to the 3' half of HCMV-UL80, 
consistent with the 37-kDa size of the Colburn strain assembly protein which is 
probably processed at the carboxy terminus (Robson and Gibson 1989). A larger 
transcript of 1.8-kb is also encoded at this locus. The 5' portion of the HCMV-UL80 
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frame is conserved in the other sequenced human herpesviruses. It thus seems likely 
that at least two seperate proteins are encoded by HCM V-UL80, with a TATA box 
at position 1 15992 being used to produce the assembly protein transcript (Robson 
and Gibson 1989). This TATA box is located within 1 5 bp which are identical in 
Colburn and AD 1 69 (Necker et al. 1 988 cited in Robson and Gibson 1 989). It is also 
noteworthy that the ATG downstream of this TATA box does not fit the Kozak 
consensus in either of the two CMV sequences. In contrast to the major DNA- 
binding protein (Sect. 6.2.2), the sequences for the putative assembly protein are 
quite divergent. The Colburn sequence from the first methionine of the predicted 
cDNA reading frame exhibits approximately 40% identity to the carboxy-terminal 
371 amino acids of HCMV-UL80. 



6,7 Structural Phosphoprotein Genes 

HCMV virions contain three main phosphoproteins which appear to be lo.cated in 
the virion tegument (Roby and Gibson 1986). The largest of these is approximately 
150kDa in size, constitutes approximately 20% of virion protein content (Irmiere 
and Gibson 1983), and is also modified by O-linked glycosylation (Benko et al. 
1988). A 6360-bp region containing the ppl50 gene sequence (which corresponds to 
the reading frame HCMV-UL32) has been published and spans positions 37 157- 
43 5 1 6 of Fig. 3 viewed in the opposite orientation. A late 6.2-kb mRN A was mapped 
in this region, and its termini delineated. Some processing at an alternative polyA 
site (ATTAAA) downstream of the orthodox signal was demonstrated. The major 
RNA species is predicted to encode ppl 50 although a range of smaller RNA species 
was also detected (Jahn et al. 1987). 

The two other major phosphoproteins located in virions are pp7 1 and pp65, also 
known as the upper and lower matrix phosphoproteins respectively. The 65-kDa 
phosphoprotein is also glycosylated (Clark et al. 1984; PandecI al. 1984), and pp71 
may be similariy modified. The genes for pp65 and pp71 are located in the //mdlll L, 
c, b region of the genome and correspond to reading frames HCMV-UL83 and 
UL82 respectively. The sequence of a HindUlfBglU fragment containing these genes 
has been reported, and corresponds to nucleotides 1 17 276-121 377 of Fig. 3 viewed 
in the opposite orientation (Ruger et al. 1987). The published sequence is in error; 
position 2 1 2 ( 1 2 1 1 66 in the genome) is shown as a G but should be read as a C. This 
change does not affect the predicted coding sequences. Two transcripts which 
appear to be 3' coterminal were mapped in this region. They are an abundant 4-kb 
mRNA and a low-level 1.9-kb mRNA. The 5' ends of both transcripts have been 
located, but surprisingly no TATA box is proximal to the 4-kb transcription unit 
(Ruger et al. 1987). The 4-kb message should encode pp65, while the shorter mRNA 
would allow pp71 to be translated. The mRNA encoding pp65 (ICP27) in HCMV 
Towne appears to be produced efficiently both eariy and late in infection, but is not 
translated at high levels until the late phase (Geballe et al. 1986b; but see Depto 
and Stenberg 1989). The gene sequences for two further structural phosphoproteins 
have been reported (Meyer et al. 1988; Davis and Huang 1985). The data of Meyer 
et al. (1988) represent positions 143 791-145 1.91 of our sequence in the HindlU R 
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fragment and show the gene for a 28-kDa protein encoded by a late 1.3-kb RNA. 
Martinez et al. ( 1 989) and Martinez and St. Jeor (1986) mapped a 25-kDa protein 
to the same locus and assigned a 1.6-kDa late mRNA as the message. These RNAs 
are likely to be initiated from one or both of two TATA boxes proximal to HCMV- 
UL99. An HCMV Towne 1.4-kb late mRNA localized to this region may also 
denote HCMV-UL99 (Pande et al. 1988). However, the Towne protein migrates as 
a 32-kDa protein. If the same frame is in fact being used, nontrivial explanations for 
the difference could be invoked at the genetic, transcriptional, and protein- 
processing levels. It is interesting to note that a minor 27-kDa species was detected 
by Pande et al. (1988) in infected cells and virions. 

An example of a phosphoprotein gene that appears not to be conserved between 
HCMVs Towne and AD169 was mapped and sequenced from passage 36 of HCMV 
Towne (Davis et al. 1984; Davis and Huang 1985). This gene encodes an abundant 
late transcript, and immunological evidence suggests that its product is a 67-kDa 
nonglycosylated phosphoprotein found in virions. The sequenced fragment corres- 
ponds very approximately to a region of AD169 HindlU D beginning at about 
position 95 500. There appear to be significant differences between the two genomes 
in this region. These include numerous point and frameshift mutations and a 
deletion of 61 bp in Towne relative to AD169. A consequence of some of these 
differences is the disruption of the putative Towne reading frame in AD 1 69, 
although a portion of the predicted phosphoprotein sequence is preserved in 
HCMV-UL65. The reported sequence was not determined fully on both strands, 
and not all sequenced fragments were shown to be contiguous. Hence further 
comparative sequence analysis and transcript mapping will be necessary before 
these findings can be interpreted unambiguously, particulariy as the equivalent 
region in ADi69 contains some potential splice sites. A gene which is posttranscrip- 
tionally regulated by an mRNA 3'-end processing event was partially sequenced and 
shown to contain a potential stem-loop structure (Coins and Stinski 1986). This 
sequence maps to positions 96753-97076, and may therefore correspond to the 3' 
end of a transcription unit spanning HCMV-UL65. The putative stem-loop 
structure in the Towne sequence is conserved in AD 169, although there arc three 
deletions relative to AD 169 clustering in the 3'-terminal 25 nucleotides of the 
published sequence. 

6.8 Surface Glycoproteins 

The importance of glycoproteins as surface antigens has made the major HCMV 
glycoproteins a focus for characterization and functional studies. A total of 54 
reading frames have now been found in the sequence that have charctcristics of 
glycoprotein genes or of exons of glycoprotein genes. These are presented in Table 3, 
which shows the predicted signal sequences, the number of N-linked glycosylation 
sites, and the anchor sequences. Twenty-two of these frames lack either a signal or an 
anchor. In the following sections we consider two immunologically important 
glycoproteins, and two which have homology to host immunoglobulin superfamily 
proteins. Known IE glycoprotein genes and glycoprotein gene families are 
considered separately in Sects. 5 and 7 respectively. 
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6.8.1 Glycoproteins B and H 

There are seven virion glycoproteins encoded by HSV-1 and one putative 
glycoprotein {US5) predicted from the sequence (McGeoch et al. 1988a). Of these 
five have counterparts in the sequence of VZV (Davison and Scott 1986) and only 
two in the genome of EBV (Baer et al. 1984). In addition,. EBV has the gp350/220 
(BLLFla,b), BILFI, and BLRFl glycoproteins. The latter has a homolog in 
HCMV-UL73. Of the other herpesvirus glycoproteins, only homologs to gB 
{HCMV-UL55) (Cranage et al. 1986; Kouzarides et al. 1987b; Mach et al. 1986) 
and gH(HCMV-UL75) (Cranage et al. 1988; PACHtet al. 1989) have been found in 
the HCMV sequence, and so gB and gH are common to all of the well-studied 
herpesviruses. The conservation of gH in distantly related herpesviruses (Compels 
elal. 1988b) and the production by an HSV-1 ts mutant of noninfectious virus 
lacking gH (Desai et al. 1988) underpin the substantial body of immunological 
evidence that gH is essential for virus infectivity. Monoclonal antibodies tt) HCMV 
gH can neutralize virus in vitro unassisted by complement (Rasmussen et al. 1984; 
Cranage et al. 1988). Antibodies to gB are also able to neutralize virus in vitro, but 
require complement (Cranage et al. 1986), A virion envelope glycoprotein complex 
has been shown to contain gB, but the structural nature of this entity awaits 
definition (see, for example, Farrar and Green a way 1986; Gretch et al. 1988a). 
The unmodified gB precursor in AD 169 is predicted to be 102 kDa in size. This is 
processed and glycosylated to give a 145-kDa species which is proteolytically 
cleaved to produce a 55-kDa species, both of which can be detected in infected cells. 
However, the residual 90-kDa amino-terminal cleavage product is not detected 
(Cranage et al. 1986). The site of cleavage has been mapped to Arg45o ^^e gB of 
HCMV Towne and by analogy processing of the AD 169 gB is likely to occur after 
Arg459 (Spaete et al. 1988). These authors also compare the gene and protein 
sequences of gB and find identities of 94% and 95% respectively between the two 
HCMV strains. (A similar level of conservation is found between the gH sequences of 
these strains; Pachl et al. 1989.) There appear to be noteworthy differences in the 
kinetics of gB transcription in these two strains. The AD169 gB transcripts are 
produced late in infection (Kouzarides et al. 1987b) while the Towne gB mRNA is 
of the early class. However, in HCMV Towne infected cells gB is not detected 
immunologically until late in infection (Rasmussen et al. 1985b), implying that the 
two strains might use different strategies to achieve a similar result in the regulation 
of gB expression. 

6.8.2 HLA Homolog 

The identification of an HCMV glycoprotein with homology to class I major 
histocompatibility (MHC) antigens has implications for host-virus interactions 
(HCMV-UL18, Beck and Barrell 1988). The crystal structure of a human class I 
histocompatibility molecule (HLA-A2) has been solved (Bjorkman et al. 1987a), 
making it possible to predict that the HLA homolog is likely to have three 
extracellular domains analogous to the class I al-, a2-, and a3-domains. The latter 
contains a ^2-microglobulin (^2f^)-t>inding loop which is partially conserved in the 
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7 Gene Families 



In addition to gB and gH, several small glycoprotein genes were identified in 
HCMV, in US (Weston and Barrell 1986). These are arranged tandemly and tend 
to cluster as homologous blocks of reading frames, constituting a large proportion 
of the gene families found in HCMV. Interestingly, the HSV US glycoprotein genes 
are also clustered (Davison and McGeoch 1986; McGeoch et al. 1988a). We 
currently recognize nine sets of homologous genes in the AD 169 genome. There are 
three pairs (UL25 and UL35; UL82 and UL83; and US2 and 3) and six larger 
groups. Of the latter, three occur in US where they account for a total of at least 21 
genes (Weston and Barrell 1986); one family occurs in UL and RL; and two 
families are partitioned between the long and the short regions of the genome 
(Table I). The discovery of redundant protein coding sequences outside repeat 
regions was unexpected and presents a contrast to those single genes encoding 
multiple products (for example, see Sects. 6.4 and 6.5). Their presence also appears to 
contradict the virally frugal gene layout of HCMV. As individual family members 
are likely to have subtle differences in function, this paradox may be difficult to 
resolve. The characteristics of four gene families are discussed below. Proteins have 
been recognized for three of these, while the fourth is homologous to a class of 
cellular receptors. The evolutionary implications of these findings are discussed in 
Sect. 8. 



7.1 The RLU Family 

This family comprises fourteen members distributed in the long repeats and a 
portion ofUL adjacent to TRL (Table 1; Fig. 1). The sequences are characterized by 
a motif which resembles the cellular Thy-1 in a region which is conserved with some 
other members of the immunoglobulin superfamily (CA. Hutchison III, un- 
published observations). The members of the RLll Family are predicted to be 
membrane glycoproteins (Table 3). This prediction has been substantiated by the 
immunological detection of the Towne UL4-equivalent protein in infected cells and 
virions (Chang etal. 1989a). The detected 48 kd protein is expressed during the 
early phase of infection, and its presence in virions has led to its classification as an 
early structural glycoprotein (Chang et al. 1989a). Its published amino acid 
sequence is 84% identical to UL4 over 150 amino acids. Multiple alignment of the 
RLll family suggests that UL4 (which does not contain an anchor sequence) may be 
spliced to UL5 (which has an anchor but no signal or N-glycosylation sites), as their 
respective RLll homologous regions appear to dovetail somewhat. However, 
splicing was not observed in transcript mapping experiments (Chang et al. 1989b). 
Nevertheless, Chang et al. (1989a) detect a protein reduced in size from 48 kd to 
27 kd protein when infected cells are treated with an inhibitor of N-linked 
glycosylation, although the theoretical size of UL4 alone is approximately 17kd. 
While this difference could be attributable to other post-translational modifications, 
it is noteworthy that the theoretical size of RLl 1, which is homologous to both UL4 
and UL5, is approximately 27 kd. The mapped transcripts, which are initiated from 
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three different promoters, also contain the UL5 reading frame. Hence it may be of 
ZTrcsi to furtSer characterize the 27kDA protein. UL8 .s truncated sjm.larly o 
UL5 and therefore is also a candidate for splicing. As both these frames also contam 
KozIk consensus ATG codons, a potential exists for the expression of th.s gene 
family to be regulated in a complex manner. 

12 The US6 Family 

This family corresponds tofamilyZdescribedby WESTON and BARRELL(1986)andis 
characteriLd by two areas of sequence homology, the second of which (region 2 
WESTON and Barrell 1986)) is less well conserved. The region 1 core mot^can be 
defined as C(VY)x(DQKR) (7-10) WxxxGxF where the bracketed residues are 
aUematives and x is any residue. The region 2 motif is charactenzed by cysteine and 
proline residues: PCxxC (4-6) CxPxxxxPWxP. The six members of this farn'ly are 
p Sed to be membrane glycoproteins (Tables 1 and 3). Gretch et al^(1988b) 
Lve recently used a M Ab to demonstrate that this family correlates with the gp47- 
52 virion envelope glycoprotein complex they described previously (Gretch et al. 
1988a) Northern hybridization revealed three early transcripts from this region, two 
of which were minor species. The 1.6-kb size of the major transcript was consistent 
with initiation from the HCMV-USl 1 (HXLFl)TATA box, and ,n vitro translation 
rxp^nments suggested it was bicistronic in nature. Gretch et al. (19 8a sugges^ on 
thrbasis of these data and amino acid composition analysis that the main 
constituents of gp47-52 might be HCMV-USIO and USl 1 proteins. However, no 
direct correlation was established between the abundance of the putative transcript 
and the composition of gp47-52. 

73 The US22 Family 

This family is distributed in UL, US and RS and sequences for eight of the thirteen 
recognized members have been published, including the family 4 members described 
by W^TON and Barrell (1986). Genes attributed to this family contain one or more 
o?thrS sequence motifs (Kouzarides et al. 1988). The first ""o'-f (°°CCxxxLxxoG 
where o islny hydrophobic residue and x any residue) is found in all of the member 
Txc^pt IRSATRSl and UL28. Interestingly, in HCMV-UL36 the J-^^'onf ^ 
and 2 occurs immediately before the motif (Kouzarides et al. 1988). As HCM V- 
UL42 ends within the motif (FLCCDKFLPG- COQ-), u --f^P^^^'.^" 
gene and perhaps other members of the family apart from HCMV-UL36, encode 
S"ranscripts. The remainder of the pattern comprises two motifs which are 

r gdy hydrophobic and may overiap in function. The IRSATRSl genes, identica 
over most of their length, diverge shortly after the third mot. Apart from th 
conserved motifs, several of these sequences contain short runs of charged residues 

n S carboxy-terminal domains, and 6 of the 12 members of the US22 gene family 
have at least 1 N-Hnked glycosylation site. However, there does not appear to be any 
obvious correlation between these latter features. The only present correlation 
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between this gene family and viral proteins comes from the identification of the 
HCM V-US22 gene product ICP22. This is an early protein localizing in the nucleus 
which is also detectable in the cytoplasm and may be secreted from infected cells 
{MoCARSKi et al. 1988). Interestingly, the MAb used identifying US22 does not 
appear to recognize any of its homologs. 



7.4 The G-Protein Coupled Receptor (GCR) Family 

Several HCMV-reading frames, mostly located in US, are predicted to be integral 
membrane proteins capable of spanning the membrane several times (Table 1 ). All of 
these have seven potential membrane-spanning regions. Three of the reading frames, 
HCMV-US27 and HCMV-US28 (originally named HHRF2 and HHRF3; Weston 
and Barrell 1986), and HCMV-UL33, show homology to the opsin family of cell 
surface receptors (Chee et al., submitted). Members of this diverse family of receptors 
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Fig. 4. An alignment of the three HCMV G-proiein-coupled receptor homologs with bovine rhodopsin 
(Nathans and Hogness 1983), human /?-2-adrenergic receptor (B-2-ADR) (Kobilka et al. 1987), and 
porcine muscarinic acetylcholine receptor (MAR) (KuBOet al. 1986), The NXT/S motifs are underlined in 
the N-ierminal extracellular domain and identities which correspond in at least five of the six sequences 
are boxed. The seven membrane-spanning helical domains are indicated by numbered bars beneath the 
alignment. Each transmembrane domain and its disposition is defined by a motif unique within the 
sequence. The alignment has been truncated within the cytoplasmic C-terminal domains which possess 
receptor-specific functions, and sections of 30 and 134 amino acids have been excised from the B-2-ADR 
and MAR sequences respectively beginning at position 248. The two conserved cysteine residues al 
alignment positions 117 and 203 have been shown to be essential for function in bovine rhodopsin 
(Karnik et al. 1988) 
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which unify this highly divergent group of viruses are now coming into focus at the 
genetic level The sequences have facilitated the correlation of biological and genetic 
experiments, and allowed much of this work to be generalized. The growing body of 
relational knowledge should make it increasingly informative to begin the 
characterization of herpesvirus genomes by sequencing. These data will continue to 
provide predictions which can be tested, and which promise to shed further light on 
the herpesviruses and their eukaryotic environment. 
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